Phylogenetic Networks Do not Need to Be Complex: 
Using Fewer Reticulations to Represent Conflicting Clusters 



Leo van lersel^, Steven Kelk^, Regula Rupp'^ and Daniel Huson^ 

^ University of Canterbury, Department of Mathematics and Statistics, 
Private Bag 4800, Christchurch, New Zealand, l.j.j.v.iersel@gmail.com 
^ Centrum voor Wiskunde en Informatica (CWI) 
P.O. Box 94079, 1090 GB Amsterdam, The Netherlands, s.m.kelk@cwi.nl 
^ Center for Bioinformatics ZBIT, Tiibingen University 
Sand 14, 72076 Tiibingen, Germany. {huson,rrupp}@informatik.uni-tuebingen.de 



Abstract Phylogenetic trees are widely used to display estimates of how groups of species evolved. 
Each phylogenetic tree can be seen as a collection of clusters, subgroups of the species that evolved from 
a common ancestor. When phylogenetic trees are obtained for several data sets (e.g. for different genes), 
then their clusters are often contradicting. Consequently, the set of all clusters of such a data set cannot 
be combined into a single phylogenetic tree. Phylogenetic networks are a generalization of phylogenetic 
trees that can be used to display more complex evolutionary histories, including reticulate events such as 
hybridizations, recombinations and horizontal gene transfers. Here we present the new Cass algorithm 
that can combine any set of clusters into a phylogenetic network. We show that the networks constructed 
by Cass are usually simpler than networks constructed by other available methods. Moreover, we 
show that Cass is guaranteed to produce a network with at most two reticulations per biconnected 
component, whenever such a network exists. We have implemented Cass and integrated it in the freely 
available Dendroscope software. 



1 Introduction 

Phylogenetics studies the reconstruction of evolutionary histories from genetic data of currently living or- 
ganisms. A (rooted) phylogenetic tree is a representation of such an evolutionary history in which species 
evolve by mutation and speciation. The leaves of the tree represent the species under consideration and the 
root of the tree represents their most recent common ancestor. Each internal node represents a speciation: 
one species splits into several new species. Thus, mathematically speaking, such a node has indegree one 
and outdegree at least two. In recent years, a lot of work has been done on developing methods for comput- 
ing (rooted) phylogenetic networks [5,10], which form a generalization of phylogenetic trees. Next to nodes 
representing speciation, rooted phylogenetic networks can also contain reticulations: nodes with indegree at 
least two. Such nodes can be used to represent recombinations, hybridizations or horizontal gene transfers, 
depending on the biological context. In addition, phylogenetic networks can also be interpreted in a more 
abstract sense, as a visualization of contradictory phylogenetic information in a single diagram. 

Suppose we wish to investigate the evolution of a set X of taxa (e.g. species or strains). Each edge 
of a rooted phylogenetic tree represents a cluster: a proper subset of the taxon set X . In more detail, an 
edge {u,v) represents the cluster containing those taxa that are descendants of Each phylogenetic tree T 
is uniquely defined by the set of clusters represented by T. Phylogenetic networks also represent clusters. 
Each of their edges represents one "hardwired" and at least one "softwired" cluster. An edge {u, v) of a 
phylogenetic network represents a cluster C G X in the hardwired sense if C equals the set of taxa that are 
descendants of v. Furthermore, (u, v) represents C in the softwired sense if C equals the set of all taxa that 
can be reached from v when, for each reticulation r, exactly one incoming edge of r is "switched on" and the 
other incoming edges of r are "switched off" . An equivalent definition states that a phylogenetic network N 
represents a cluster C in the softwired sense if there exists a tree T that is displayed by N (formally defined 
below) and represents C. In this paper we will always use "represent" in the softwired sense. It is usually the 



clusters in a tree that are of more interest, and less the actual trees themselves, as clusters represent putative 
monophyletic groups of related species. For a complete introduction to clusters see Huson and Rupp [10]. 

In phylogenetic analysis, it is common to compute phylogenetic trees for more than one data set. For 
example, a phylogenetic tree can be constructed for each gene separately, or several phylogenetic trees can 
be constructed using different methods. To accurately reconstruct the evolutionary history of all considered 
taxa, one would preferably like to use the set C of all clusters represented by at least one of the constructed 
phylogenetic trees. In general however, some of the clusters of the different trees will be conflicting, which 
means that there will be no single phylogenetic tree representing C. Therefore, several recent publications 
have studied the construction of a phylogenetic network representing C. Huson and Rupp [11] describe how 
a phylogenetic network can be constructed that represents C in the hardwired sense (a cluster network). A 
network is a galled network if it contains no path between two reticulations that is contained in a single 
biconnected component (a maximal subgraph that cannot be disconnected by removing a single node). Huson 
and Klopper [8] and Huson et al. [12] describe an algorithm for constructing a galled network representing C 
in the softwired sense. 

Related literature describes the construction of phylogenetic networks from phylogenetic trees or triplets 
(phylogenetic trees on three taxa) . A tree or triplet T is displayed by a network N if there is a subgraph T' 
of N that is a subdivision of T (i.e. T' can be obtained from T by replacing edges by directed paths). 
Computing the minimum number of reticulations required in a phylogenetic network displaying two input 
trees (on the same set of taxa) was shown to be APX-hard by Bordewich and Semple [2]. Bordewich et al. [1] 
proposed an exact exponential-time algorithm for this problem and Linz and Semple [18] showed that it is 
fixed parameter tractable (FPT), if parameterized by the minimum number of reticulations. The downside 
of these algorithms is that they are very rigid in the sense that one generally needs very complex networks 
in order to display the given trees. 

The level of a binary network is the maximum number of reticulations in a biconnected component and 
thus provides a measure of network complexity. Given an arbitrary number of trees on the same set of taxa, 
Huynh et al. [13] describe a polynomial-time algorithm that constructs a level-1 phylogenetic network that 
displays all trees and has a minimum number of reticulations, if such a network exists (which is unlikely 
in practice). Given a triplet for each combination of three taxa, Jansson, Sung and Nguyen [16,17] give a 
polynomial-time algorithm that constructs a level-1 network displaying all triplets, if such a network exists. 
The algorithm by van lersel and Kelk [15] can be used to find such a network that also minimizes the number 
of reticulations. These results have later been extended to level-2 [14,15] and more recently to level-fc, for 
all fc € N [19]. Although this work on triplets is theoretically interesting, it has the practical drawback that 
biologists are not interested in triplets (but rather in trees or clusters) and that these algorithms need a 
triplet for each combination of three taxa as input, while some triplets might be difficult to derive correctly. 

In this article, we present the algorithm Cass^, which takes any set C of clusters as input and constructs 
a phylogenetic network that represents C. Furthermore, the algorithm aims at minimizing the level of the 
constructed network and in this sense Cass is the first algorithm to combine the flexibility of clusters with 
the power of level minimization. Cass constructs a phylogenetic tree representing C whenever such a tree 
exists. Moreover, we prove that Cass constructs a level-1 or level-2 network representing C whenever there 
exists a level-1 or level-2 network representing C, respectively. Experimental results show that also when no 
level-2 network representing C exists, Cass usually constructs a network with a significantly lower level and 
lower number of reticulations compared to other algorithms. In fact, we conjecture that similar arguments as 
in our proof for level-2 can be used to show that Cass always constructs a level-fc network with minimum fc. 
We prove a decomposition theorem for level-fc networks that supports this conjecture. Finally, we prove that 
Cass runs in polynomial time if the level of the output network is bounded by a constant. 

We have implemented Cass and added it to our popular tree-drawing program Dendroscope [9], where 
it can be used as an alternative for the cluster network [11] and galled network [12] algorithms. Experiments 
show that, although Cass needs more time than these other algorithms, it constructs a simpler network 
representing the same set of clusters. For example. Figure 1(a) shows a set of clusters and the galled network 

* In Section 2 we generalize the notion of level to non-binary networks. 
® Named after the Cass Field Station in New Zealand. 
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Figure 1. (a) The output of the galled network algorithm [12] for C = {{a, b, /, g, i}, {a, b, c, f, g, i}, {a, b, /, i}, 
{b,c,f,i}, {c,d,e,h}, {d,e,h}, {b,c,f,h,i}, {b,c,d, f,h,i}, {b,c,i}, {a,g}, {b,i}, {c,i}, {d,h}} and (b) the 
network constructed by Cass for the same input. 

with four reticulations constructed by the algorithm in [12]. However, for this data set also a level-2 network 
with two reticulations exists, and Cass can be used to find this network, see Figure 1(b). Dendroscope now 
combines the powers of Cass and the two previously existing algorithms for constructing galled- and cluster 
networks. 

2 Level-fc Networks and Clusters 

Consider a set X of taxa. A (phylogenetic) network (on X) is a directed acyclic graph with a single root 
and leaves bijectively labeled by X. The indegree of a node v is denoted S~{v) and v is called a reticulation 
if S~ (v) > 2. An edge (u, v) is called a reticulation edge if its head w is a reticulation and is called a tree edge 
otherwise. The reticulation number of a phylogenetic network N — (V, E) is defined as 

J2 {S~{v)-l)^\E\-\V\ + l . 

veV:5~ {v)>0 

A directed acyclic graph is connected (also called "weakly connected") if there is an undirected path 
(ignoring edge orientations) between each pair of nodes. A node (edge) of a directed graph is called a 
cut-node (cut-edge) if its removal disconnects the graph. A directed graph is biconnected if it contains no 
cut-nodes. A biconnected subgraph B of a directed graph G is said to be a biconnected component if there 
is no biconnected subgraph B' ^ B oi G that contains B. 

A phylogenetic network is said to be a level-k network if each biconnected component has reticulation 
number at most k.^ A phylogenetic network is called binary if each node has either indegree at most one and 
outdegree at most two or indegree at most two and outdegree at most one. Note that the above definition of 
level generalizes the original definition [3] for binary networks. A level-fc network is called a simple level-< k 
network if the head of each cut-edge is a leaf. A simple level- < k network is called a simple level-k network if 
its reticulation number is precisely k. A phylogenetic tree (on X) is a phylogenetic network (on X) without 
reticulations, i.e. a level-0 network. 

Consider a set of taxa X. Proper subsets of X are called clusters. We say that two clusters Ci,C2 C X 
are compatible if either Ci n C2 = or Ci C C2 or C2 C Ci. Consider a set of clusters C. We say that 
a set of taxa X C X is separated (by C) if there exists a cluster C E C that is incompatible with X. The 
incompatibility graph IG{C) of C is the undirected graph {V, E) that has node set V — C and edge set 

= {{Ci, C2} I Ci and C2 are incompatible clusters in C} . 

^ Note that to determine the reticulation number of a biconnected component one only counts edges inside this 
biconnected component. 



3 Decomposing Level-fc Networks 



In this section, we describe the general outhne of our algorithm Cass. We show how the problem of deter- 
mining a level-fc network can be decomposed into a set of smaller problems by examining the incompatibility 
graph. Our algorithm will first construct a simple level-< k network for each connected component of the 
incompatibility graph and subsequently merge these simple level-< k networks into a single level-k network 
on all taxa. 

Consider a set of taxa X and a set C of input clusters. We assume that all singletons (sets {a;} with x ^ X) 
are clusters in C. Our algorithm proceeds as follows. 

Step 1. Find the nontrivial connected components Ci , . . . , Cp of the incompatibility graph IG{C) . For each i € 
{!,... ,p}, let Ci be the result of collapsing unseparated sets of taxa as follows. Let Xi — IJcec ^ ■ ^'^''^ each 
maximal subset X <Z Xi that is not separated by Ci, replace, in each cluster in Ci, the elements of X by a 
single new taxon X, e.g. li X = {h, c} then a cluster {a, b, c, d} is modified to {a, {b, c}, d}. 

Step 2. For each i E {I, . . . ,p}, construct a simple level-< k network Ni representing C/ . 

Step 3. Let C* be the result of applying the following modifications to C, for each i e {1, . . . ,p}: remove all 
clusters that are in Ci, add a cluster Xi and add each maximal subset X G Xi that is not separated by Ci. 
Construct the unique phylogenetic tree T on X representing precisely those clusters in C* . 

Step 4. For each i e {I, . . . ,p}, replace in T the lowest common ancestor Vi of Xi by the simple level- < k 
network Ni as follows. Delete all edges leaving Vi and merge T with Ni by identifying the root of Ni with Vi 
and identifying each leaf of Ni labeled X by the lowest common ancestor of the leaves labeled X in T. 
Output the resulting network. 

Notice that Steps 1,3 and 4 are similar to the corresponding steps in algorithms for constructing galled 
trees (i.e. level-1 networks) and galled networks [8,10,12]. The reason why we use the same set-up in our 
algorithm, is outlined by Theorem 1. It shows that, when constructing a level-fc network displaying a set of 
clusters, we can restrict our attention to level-fc networks that satisfy the decomposition property [10], the 
definition of which we repeat below. 

Because a cluster C € C can be represented by more than one edge in a network N, an edge assignment e 
is defined as a mapping that chooses for each cluster C G C a single tree edge e(C) of N that represents C. 
A network N representing C is said to satisfy the decomposition property w.r.t. C if there exists an edge 
assignment e such that: 

• for any two clusters Ci,C2 £ C, the edges e(Ci) and e(C2) are contained in the same biconnected com- 
ponent of N if and only if Ci and C2 lie in the same connected component of the incompatibility 
graph /G(C). 

Theorem 1. LetC be a set of clusters. If there exists a level-k network representing C, then there also exists 
such a network satisfying the decomposition property w.r.t. C. 

Proof. Let C be a set of input clusters and N a level-fc network representing C. Let Ci, . . . ,Cphe the nontrivial 
connected components of the incompatibility graph IG{C). For each i e {1, . . . ,p}, we construct a simple 
level-< k network Ni as follows. Let Xi = IJcec ^ before. For each maximal subset X G Xi (with \X\ > 1) 
that is not separated by Ci , replace in N an arbitrary leaf labeled by an element of X by a leaf labeled X 
and remove all other leaves labeled by elements of X. In addition, remove all leaves with labels that are not 
in Xi. We tidy up the resulting graph by repeatedly applying the following five steps until none is applicable: 
(1) delete unlabeled nodes with outdegree 0; (2) suppress nodes with indegree and outdegree 1 (i.e. contract 
one edge incident to the node); (3) replace multiple edges by single edges, (4) remove the root if it has 
outdegree 1 and (5) contract biconnected components that have only one outgoing edge. This leads to a 
level-fc network Ni. Let C/ be defined as in Step 1 of the algorithm. By its construction, Ni represents C/ . 
Furthermore, Ni is a simple level-< fc network, because if it would contain a cut-edge e whose head is not 
a leaf, then the set of taxa labeling leaves reachable from e would not be separated by C/ and would hence 



have been collapsed. Finally, the networks iVi, . . . ,Np can be merged into a level-A; network representing C 
and satisfying the decomposition property by executing Steps 3 and 4 of the algorithm. □ 

Note that the analogous statement obtained by replacing "level-fc network" by " network with k reticulations" 
does not hold, as shown in [12], based on [7]. 

4 Simple Level-fc Networks 

This section describes how one can construct a simple level-fc network representing a given set of clusters. We 
say that a phylogenetic tree T is a strict subtree of a network A'^ if T is a subgraph of N and for each node v 
of T, except its root, it holds that the in- and outdegree of v in T are equal to the in- and (respectively) 
outdegree of u in A^. 

Informally, our method for constructing simple level-fc networks operates as follows. We loop through all 
taxa X. For each choice for x, we remove it from each cluster and subsequently collapse all maximal "ST-sets" 
("strict tree sets", defined below) of the resulting cluster set. We repeat this step fc times. The idea behind 
this strategy is as follows. Observe that any simple level-fc network N contains a leaf whose parent is a 
reticulation. If we would remove this leaf and reticulation from N , the resulting network might contain one 
or more strict subtrees. Each such strict subtree corresponds to an ST-set. Moreover, for the case fc < 2 
we prove that (without loss of generality) each maximal strict subtree corresponds to a maximal ST-set. 
Collapsing each maximal strict subtree of the network would lead to a (not necessarily simple) level-(fc — 1) 
network, which would again contain a leaf whose parent is a reticulation. It follows that we can indeed repeat 
the described steps fc times, after which all leaves will be collapsed into just two taxa and the second phase 
of the algorithm starts. 

We create a network consisting of a root with two children, labeled by the only two taxa. Then we 
"decoUapse" , i.e. we replace each leaf labeled by an ST-set by a strict subtree. Subsequently we add a new 
leaf below a new reticulation and label it by the latest removed taxon. Since we do not know where to create 
the new reticulation, we try adding the reticulation below each pair of edges. For each constructed simple 
level-fc network, we check whether it represents all input clusters. If it does, we output the resulting network, 
after contracting any edges that connect two reticulations. 

Let us now formalize this algorithm. Given a set S* C A" of taxa, we use C \ S* to denote the result of 
removing all elements of S from each cluster in C and we use C\S to denote C\{X\S) (the restriction of C 
to S). We say that a set 5 ^ A" is an ST-set (strict tree set) w.r.t. C, if S is not separated by C and any two 
clusters Ci, C2 G C\S are compatible. An ST-set S is maximal if there is no ST-set T with S C T. Informally, 
the maximal ST-sets are the result of repeatedly collapsing pairs of unseparated taxa as long as possible. 

We use Collapse(C) to denote the result of collapsing each maximal ST-set S into a single taxon S. 
More precisely, for each cluster C G C and maximal ST-set S of C, we replace C by C\S'U{{S'}}. For 
example (omitting singleton clusters), if 

C = { {1,2}, {2,3,4}, {3,4} } , 

then {3,4} is the only nonsingleton maximal ST-set and 

COLLAPSE(C) = { {1,2}, {2, {3, 4}} } . 

The set of taxa of a (collapsed) cluster set C is denoted X{C). Thus, for the above example, A'(Collapse(C)) = 
{1,2, {3,4}}. We are now ready to give the pseudocode of CASS(fc) in Algorithm 1. The actual implementa- 
tion is slightly more complex and much more space efficient. 

Figure 2 shows how the Cass(2) algorithm for example constructs a simple level-2 network. We will now 
show that Cass(1) and Cass(2) will indeed construct a simple level- 1 respectively level-2 network whenever 
this is possible. 
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Figure 2. Construction of a simple level-2 network by the Cass algorithm. The edges ei, 62 that will be sub- 
divided are colored red. Singleton clusters have been omitted, as well as the last collapse-step, for simplicity. 



Lemma 1. Given a set of clusters C, such that IG(C) is connected and any X C. X is separated, Cass(1) 
and Cass(2) construct a simple level- 1 respectively a simple level-2 network representing C , ij such a network 
exists. 



Proof. The general idea of the proof is as follows. Details have been omitted due to space constraints. 
Assume fc < 2. It is clear that any (simple) level-fc network N contains a reticulation r with a leaf, say 
labeled x, as child. Let N \ {x} denote the network obtained by removing the reticulation r and the leaf 
labeled x from N . This network might contain one or more strict subtrees. By the definition of ST-set, the 
set of leaf-labels of each maximal strict subtree corresponds to an ST-set w.r.t. C \ {x}. However, in general 
not each such set needs to be a maximal ST-set. This is critical, because the total number of ST-sets can 
be exponentially large. Therefore, the main ingredient of our proof is the following. We show that whenever 
there exists a simple level- A: network representing C, there exists a simple level- A: network N' representing C 
such that the sets of leaf-labels of the maximal strict subtrees of N' \ {x} are the maximal ST-sets w.r.t. 
C \ {x}, with X the label of some leaf whose parent is a reticulation in N' . This is clearly true for fc = 1. 
For k = 2 we sketch our proof below. 

Let us first mention that the actual algorithm is slightly more complicated than the pseudocode in 
Algorithm 1. Firstly, when Cass(/c) constructs a tree, it adds a new "dummy" root to this tree and creates 
an edge from this dummy root to the old root. Such a dummy root is removed before outputting a network. 
Secondly, whenever the algorithm removes a dummy taxon d, it makes sure that it does not collapse in the 
previous step. 

Suppose there exists some level-2 network representing C. It can be shown that any such network is simple 
and that there exists at least one binary such network, say N. Since is a binary simple level-2 network, 
there are only four possibilities for the structure of A^ (after removing leaves), see [14]. These structures are 
called generators. In each case, A \ {x} contains at most two maximal strict subtrees that have more than 



Algorithm 1 CASS(fc): constructing a simple level-fc network from clusters 



1: input {C,X,k,k') 

2: output Cass{C, X,k,k') 

3: //in tlie initial call to the algorithm, k' = k 

4; AA — 

5; if fc' = then 

6: return the unique tree representing exactly those clusters in C or return if no such tree exists 

7; for a; G A' U {d} do 

8: // d is a dummy taxon not in X 

9: remove leaf: C' ■-C\ {x} 
10: collapse: C" := Collapse(C') 
11: recurse: Af' := Cass{C" ,X{C"), k, k' - 1) 
12: for N' e Af' do 

13: decollapse: replace each leaf of A''' labeled by a maximal ST-set S w.r.t. C' by the tree on S representing 

exactly those clusters in C'jS* 
14: for each pair of edges 61,62 do 

15: add leaf below reticulation: create a reticulation t, a leaf I labeled x and an edge from t to I; 

16: for 1 = 1,2, insert a node Vi into and add an edge from Vi to t, this gives network TV 

17: if A'^ represents C then 

18: save network: Af ~ U {N} 

19: if = fc' then 

20: return any simple level-fc network in Af, after removing each leaf labeled d and contracting each edge connecting 

two reticulations 
21: else 

22: return A/" 



one leaf. Furthermore, N \ {x} contains exactly one reticulation r', below which hangs a strict subtree 
with set of leaf-labels Xr (possibly, \Xr \ = 1 or \Xr \ = 0). 

First we assume that Xr is not a maximal ST-set w.r.t. C \ {x}. In that case it follows that there is 
some maximal ST-set X that contains Xr and also contains at least one taxon labeling a leaf £ that is not 
reachable by a directed path from the reticulation of {x}. We can replace ^ by a strict subtree on X that 
represents C\X. Such a tree exists because X is an ST-set. We remove all leaves that label elements of X 
and are not in this strict subtree. Since there are now no leaves left below the reticulation, we can remove 
this reticulation as well. It is easy to see that the resulting network is a tree representing C \ {x}. Moreover, 
we show that in each case a leaf labeled x can be added below a new reticulation (possibly with indegree 3) 
in order to obtain a network N' that represents C. Since N' contains just one reticulation, it is clear that 
the maximal strict subtrees of N' \ {x} are the maximal ST-sets w.r.t. C \ {x}. Cass(2) reconstructs such 
a network with an indegree-3 reticulation by removing x, removing a dummy taxon d, constructing a tree, 
adding a leaf labeled d below a reticulation, adding a leaf labeled x below a reticulation, removing the leaf 
labeled d and contracting the (now redundant) edges between the two reticulations. Note that this works 
because Cass(2) does not collapse in this case. 

It remains to consider the possibility that Xr is a maximal ST-set w.r.t. C \ {x}. In this case we modify 
network N to N' in such a way that also the other maximal ST-sets w.r.t. C \ {x} appear as the leaf-sets of 
strict subtrees in N' \ {x}. We again use a case analysis to show that this is always possible in such a way 
that the resulting network N' represents C. □ 

Lemma 2. Cass runs in time 0(|A'|'^'^+^ • \C\), if k is fixed. 

Proof. Omitted due to space constraints. □ 

Theorem 2. Given a set of clusters C, Cass constructs in polynomial time a level-2 network representing C, 
if such a network exists. 



Proof. Follows from Lemmas 1 and 2 and Theorem 1. 
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Table 1. Results of Cass compared to GalledNetwork for several example cluster sets with |C| clusters 
and \X\ taxa. For each algorithm, the level k and reticulation number r of the output network are given as 
well as the running time t in minutes m and seconds s on a l.QlGhz 2GB laptop. The last row gives the 
average values. 



We conclude this section by showing that for each r > 2, there exists a set of clusters Cr such that any 
galled network representing Cr needs at least r reticulations, while Cass constructs a network with just two 
reticulations, which also represents Cr- This follows from the following lemma. 

Lemma 3. For each r > 2, there exists a set Cr of clusters such that there exists a network with two 
reticulations that represents Cr while any galled network representing Cr contains at least r reticulations. 

Proof. Omitted due to space constraints. □ 
5 Practice 

Our implementation of the Cass algorithm is available as part of the Dendroscope program [9]. To use 
Cass, first load a set of trees into Dendroscope. Subsequently, run the algorithm by choosing "options" 
and "network consensus" . The program gives you the option of entering a threshold percentage t. Only 
clusters that appear in more than t percent of the input trees will be used as input for Cass. Choose 
"minimal network" to run the Cass algorithm to construct a phylogenetic network representing all clusters 
that appear in more than t percent of the input trees. 

Cass computes a solution for each biconnected component separately. If the computations for a certain 
biconnected component take too long, you can choose to "skip" the component, in which case the program 
will quickly compute the cluster network [11] for this biconnected component, instead. Alternatively, you 
can choose to construct a galled network, or to increase the threshold percentage t. For more information on 
using Dendroscope, see [9]. 

We have tested Cass on both practical and artificial data and compared Cass to other programs. The 
results (using t = 0) are in Tables 1 and 2. For the former table, several example data sets have been used, 
which have been selected in such a way as to obtain a good variation in number of taxa, number of clusters 
and network complexity. For each data set, we have constructed one network using Cass, which we call the 
CASS-network, and one galled network using the algorithm in [12]. Two conclusions can be drawn from the 
results. Firstly, Cass uses more time than the galled network algorithm. Nevertheless, the time needed by 
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Table 2. Results of Cass compared to two HybridNumber and HybridInterleave for several combina- 
tions of two input trees with \X\ the number of taxa the two trees have in common, with r the reticulation 
number, k the level and t the running time in hours h, minutes m and seconds s. The averages do not include 
the data sets for which HybridNumber and Cass did not find a solution within two days (denoted "> 2d"). 



Cass can still be considered acceptable for phylogenetic analysis. Secondly, Cass constructs a much simpler 
network in almost all cases. For three data sets, the CASS-network and the galled network have the same 
reticulation number and the same level. For all other data sets, the CASS-network has a significantly smaller 
reticulation number, and also a lower level, than the galled network. 

In Table 2 are the results of an application of Cass to practical data. This data set consists of six 
phylogenetic trees of grasses of the Poaceae family, originally published by the Grass Phylogeny Working 
Group [6]. The phylogenetic trees are based on sequences from six different gene loci, ITS, ndhF, phyB, 
rbcL, rpoC and waxy, and contain 47, 65, 40, 37, 34 and 19 taxa respectively. We have compared the results 
of Cass with results of HybridNumber [1], which is a program that computes the minimum number 
of reticulations required to combine two phylogenetic trees (on the same set of taxa) into a phylogenetic 
network that displays both trees. Since HybridNumber can only be used for pairs of trees with identical 
taxon sets, we have also apphed Cass to pairs of trees and restricted each of the two trees to the taxa 
that appear in both. Table 2 shows that the running time of Cass is much better than the running time of 
HybridNumber. Very recently, an improved version of HybridNumber, called HybridInterleave [4], 
was published, which is significantly faster than the original program. 

Table 2 shows that Cass requires significantly fewer reticulations than HybridNumber and Hybrid- 
Interleave. This is caused by the fact that the latter programs require that a network displays both input 
trees. The networks constructed by Cass do not necessarily display both input trees, but still represent all 
clusters from both trees, and use fewer reticulations to do so. Other advantages of Cass are that it can also 
be used for cluster sets that are obtained from more than two trees and for cluster sets obtained from trees 
on nonidentical sets of taxa. Moreover, while HybridNumber and HybridInterleave only compute the 
required number of reticulations, Cass also constructs an actual network. See for example Figure 3 for the 
output network of Cass for the ndhF and phyB trees of the Poaceae data set. 

6 Discussion 

We have introduced the Cass algorithm, which can be used to combine any set of clusters into a phylo- 
genetic network representing those clusters. We have shown that the algorithm performs well on practical 
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Figure 3. Level-4 network constructed by Cass for the ndhF and phyB trees of the Poaceae grass data set, 
within 1 second. 



data. It provides a useful addition to existing software, because it usually constructs a simpler network 
representing the same set of input clusters. Furthermore, we have shown that Cass provides a polynomial- 
time algorithm for deciding whether a level-2 phylogenetic network exists that represents a given set of 
clusters. This algorithm is more useful in practice than algorithms for similar problems that take triplets as 
input [14,15,16,17,19], because the latter algorithms need at least one triplet for each combination of three 
taxa as input, while Cass can be used for any set of input clusters. Furthermore, Cass is also not restricted 
to two input trees on identical taxon sets, as the algorithms in [1,4,18]. Finally, we remark that Cass can also 
be used when one or more multi-labeled trees are given as input. One can first compute all clusters in the 
multi-labeled tree(s) and subsequently use Cass to find a phylogenetic network representing these clusters. 
Several theoretical problems remain open. First of all, does Cass always construct a minimum-level network, 
even if this minimum is three or more? Secondly, what is the complexity of constructing a minimum level 
network, if the minimum level k is not fixed but paxt of the input? Is this problem FPT when parameterized 
by kl Finally, it would be very interesting to design an algorithm that finds a network representing a set 
of input clusters that has a minimum reticulation number. So far, not even a nontrivial exponential-time 
algorithm is known for this problem. 
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A Appendix 



Lemma 1. Given a set of clusters C, such that IG{C) is connected and any X C X is separated, Cass(1) and 
Cass(2) construct a simple level-1 respectively a simple level-2 network representing C, if such a network 
exists. 

Proof. We start by proving the following claim. We assume that networks do not contain biconnccted com- 
ponents with only one outgoing edge (because such structures are highly redundant). Furthermore, in this 
proof we identify each leaf with the taxon it is labeled by, to shorten the notation. 

Claim. Given a set of clusters C, such that IG{C) is connected and any X C. X is separated, then any 
network N representing C is simple and no two leaves in N have the same parent. Additionally, if such a 
network N exists, then there also exists a binary simple network N' representing C which has the same level 
as N and such that no two leaves of N' have the same parent. 

Proof. If N is not simple then it contains a cut-edge (wi,W2) such that V2 is not a leaf and some subset of 
the taxa X' C. X is reachable by directed paths starting at V2. Since we assume that networks do not contain 
biconnected components with only one outgoing edge, \X'\ > 2. Now, given that X' is below a cut-edge, it 
follows that for every cluster C G C holds that either X' C C or C C X' . So X' is unseparated, giving us 
an immediate contradiction. To prove the second half of the lemma we show how to obtain N' from N by 
expanding edges. First we deal with nodes v that have both indegree and outdegree greater than 1. Here we 
replace the node v by an edge (ui, W2) such that the edges incoming to v now enter vi, and the edges outgoing 
from V now exit from V2. Subsequently nodes with indegree at most 1, and outdegree d > 3, can be replaced 
by a chain of (d — 1) nodes of indegree at most 1 and outdegree 2. Nodes with indegree d> 3 and outdegree 
1 can be replaced by a chain of {d — 1) nodes of indegree 2 and outdegree 1. These transformations preserve 
the reticulation number of the network and do not introduce any nontrivial cut-edges (i.e. cut-edges that do 
not have a leaf as head), so the resulting network TV is a binary simple network with the same level as iV'. 
Binary simple networks cannot contain sibling leaves, so we are done. □ 

Before continuing we need some definitions. We say that a tree-edge e = (fi,W2) (i.e. an edge where V2 is 
not a reticulation) of a network N is contraction-safe for C if N represents C and there is no C G C that is 
represented by e. (An edge (u, v) of N is said to represent a cluster C if there exists a tree T on X that is 
displayed by N, and such that C consists of all taxa reachable by a directed path from 1; in T.) Clearly such 
an edge can be contracted to obtain a new network N' that still represents C. 

We are now ready to prove the lemma. Suppose we are given a set of clusters C, such that IG(C) is 
connected and any X C X is separated. 

It is clear that Cass(O) will return in polynomial-time a tree T representing C, if it exists. In this case T 
will be the unique tree that represents C and which contains no contraction-safe edges. 

We now show that Cass(1) will return in polynomial time a simple level-1 network that represents C, if 
it exists. Suppose then that such a network, N , exists. We assume that iV is a binary simple level-1 network. 
Cass will thus at some iteration correctly guess and remove the (unique) leaf x whose parent is a reticulation 
in N. Cass will then construct the unique (and in general non-binary) tree T that represents C \ {x} and 
which contains no contraction-safe edges. To complete the level-1 case it is necessary to show that x can 
be hung back from two edges of T (in the sense of lines 14-16 of the Cass pseudocode) to create a level-1 
network representing C. In [14] it is described how, after removal of leaves, a binary simple level-A; network 
always has a topology equal to one of the binary simple level-k generators, depicted in Figure 4. We repeat 
the following definition [14]. 

Definition 1. [14] A simple level-k network N , for k > 1, is a network obtained by applying the following 
transformation ("leaf hanging") to some simple level-k generator such that the resulting graph is a valid 
network: 




Figure 4. The single level-1 generator and the four level-2 generators. 

1. replace each edge X by a path and for each internal node v of the path add a new leaf x and an edge 
{v,x); we say that "leaf x is on side X"; and 

2. for each node Y of indegree 2 and outdegree add a new leafy and an edge (Y,y); we say that "leafy 
is on side Y". 

Consider in particular that N is constructed from the unique level-1 generator 1. There are two cases. In 
the first case N has leaves on both sides A and B, in which case let a (respectively b) be the leaf on side 
A (respectively B) that is furthest from the root. We hang x from the edges in T that feed into a and 
b respectively, to obtain the network N' . We refer to the two corresponding reticulation edges as the a- 
(respectively, b-) reticulation edge. Consider a cluster C G C If {x, a} C C or {x, a} n C = then we see that 
N' represents C because T represented C \ {x} and we can simply "switch on" the a-reticulation edge. If 
{x, a} n C = {a} then b ^ C, and 6 ^ C \ {x}, so we can switch on the 6-reticulation edge. U {x,a}r\C = {x} 
then C was either a singleton (which is trivially represented) or 6 G C, in which case we can switch on the 
6-reticulation edge. In the second and final case, assume without loss of generality that only side A contains 
leaves. Let a be defined as before. We hang x back from the edge feeding into a in T, and from a new root 
node that we also connect to the old root of T. Clusters in C of the form {x, a} C C or {x, a} O C = are 
dealt with as before. If {x, a} C — {x} then C is a singleton. If {x, a} C = {a} then we can switch the 
reticulation edge leaving the new root on and we are done. 

There remains only the case that there exists a simple level-2 network representing C, but no tree or 
simple level-1 network. This case is rather complex and requires some extra terminology, although the 
central idea has much in common with the proof for simple level-1 networks that we have just presented. 
Given a network N and reticulation r with a leaf x as child, let N \ {x} denote the network obtained by 
removing the reticulation r and the leaf x from TV. We say that a network A'^ is drooped with respect to C if N 
represents C and there exists a leaf x oi N whose parent is a reticulation, such that the leaf-sets of maximal 
strict subtrees of \ {x} correspond to the maximal ST-sets of C \ {x}. 

We are going to first prove that, if there is a simple level-2 network N representing C, then there exists 
a simple level-2 network N' that is drooped w.r.t. C. We do this partially by case analysis on the four 
possible generator topologies for N shown in Figure 4. The general strategy will be to argue that either 
N is already drooped, or that it can be transformed into some new level-2 network N' representing C. If 
N' is subsequently level-1 and/or not simple then we obtain a contradiction. Otherwise, A^' is (as we will 
demonstrate) a drooped simple level-2 network. 

Consider any leaf x whose parent is a reticulation of N. Observe that N \ {x} contains exactly one 
reticulation r', below which hangs a strict subtree Tr with leaves Xr (possibly, \Xr\ = 1 or \Xr\ = 0). 
Note that, if Xr is the empty set then N contains an edge between two reticulations and contracting this 
edge leads to a network N' that is drooped w.r.t. C, because A^'\{a;} is a tree. We distinguish two major cases. 

First major case: Xj. is not a maximal ST-set w.r.t. C \ {x} and not the empty set. 

In this case it follows that there is some maximal ST-set X that contains Xr and also contains at least 
one leaf i that is not reachable by a directed path from the reticulation of \ {x}. We can replace £ by a. 



strict subtree on X that represents C\X. Such a tree exists because X is an ST-set. We remove all leaves in X 
that are not in this strict subtree. Since there are now no leaves left below the reticulation, we can remove 
this reticulation as well. Let N* be the resulting network. It is easy to see that A''* is a tree representing 
C \ {x}. We now require a case-analysis to show how the leaf x can be hung back into A''* to obtain a network 
N' that is drooped w.r.t. C. 

Case generator 2a 

Here we assume that the leaf x is equal to the leaf on side F of N. 

Let p be (in N) the leaf on side E that is furthest from the root. Such a leaf will definitely exist by 
assumption that Xr is not empty. The common core of the construction, independent of the exact case, 
requires locating p in N* and hanging x back from the edge that feeds into p. We call this reticulation edge 
the p-edge. Depending on the exact case we will additionally hang back x from one or two other places 
(i.e. to create an indegree-2 or indegree-3 reticulation respectively). But note already that, however we add 
these additional reticulation edges, a cluster C G C such that {p, x} C C or {p, x} fl C = will definitely 
be represented by N'. The argument for this is identical to that used in the level-1 case. So we only need 
to worry about non-singleton clusters C where {p,x} H C = {p} or {p,x} n C = {x}. We hang a second 
reticulation edge of x from the root of N* and call this the root-edge. Now, let I be the leaf on side D (of A^) 
that is furthest from the root of N; if such a leaf does not exist then let I be the leaf on side C (of A'') that 
is furthest from the root of A^, and if that also does not exist then let I be the leaf on side A (of A^) that is 
furthest from the root of A^. For brevity we will henceforth abbreviate this specification of I to "the lowest 
leaf on sides D; C; A". If I exists (in general it might not) then hang a third reticulation edge of x from the 
edge in A^* that feeds into I, call this the Z-edge. Consider then a non-singleton cluster C in C that contains 
X but not p. Then I exists and cluster C definitely contained it, so C \ {x} contained / and was represented 
by N*. So in A''' (the network we get after adding x below a reticulation) we can switch on the Z-edge to 
obtain cluster C. Finally, consider a cluster C that contained p, but not x. Then switching on the root-edge 
is sufficient. So A^' represents C. 

If A^' is a level-1 network then we have a contradiction. Otherwise it is a drooped simple level-2 network 
(because by the earlier claim all networks that represent C are simple). 

Case generator 2b 

Here we assume that the leaf x is the leaf on side G oi N. Let I be the lowest leaf on sides D;A of N. 
Let p be the lowest leaf on sides E; F; C; A. We hang x back below a new reticulation of indegree 2 or 3. 
More specifically: we hang x from the edges that feed into l,p and h: the leaf on side H. If I and p both 
exist and I ^ p then this reticulation clearly has indegree-3. If Z = p or at least one of I and p does not exist 
then we also hang p back from the root. (The only situation when the reticulation has indegree-2 is thus if 
neither I nor p exists). Consider now clusters in C. We distinguish several cases: in the case that neither I 
nor p existed we get a level-1 network (and thus a contradiction), otherwise a drooped level-2 network. 

Suppose that the leaf I existed. Then clusters C of the form {x, /} C C or {x, Z} n C = are (as in earlier 
cases) easy to deal with. For a non-singleton cluster C such that {x, Z} fl C = {x} holds that p G C and/or 
h G C. Then in A''' we can switch on the p-edge or the h-edge depending on which one is relevant. For a 
non-singleton cluster C such that {Z,x} (1 C = {1} holds that {h,p} fl (7 = 0, so in A''' we can switch the 
/i-edge on. 

Suppose, alternatively, that the leaf p existed. Again, "both p and x" and "neither p nor x" clusters are 
easy to deal with. So, consider a cluster that contains x but not p. Then this cluster will contain either Z or 

h and we are done. Consider a cluster that contains p but not x. If Z exists then it will not be in the cluster, 
so we are done. If I does not exist then we can use the root-edge in A^' and we are done. 

The final case is that neither I nor p exists. Clusters that contain h and x, or neither h nor x, are easy to 
deal with. So, consider a non-singleton cluster that contains x but not h. But the sides D, A, E, F, C are all 
empty, meaning that the cluster is a singleton, contradiction. For clusters that contain h but do not contain 



X we can use the root-edge in A''', and we are done. 
Case generator 2c 

Here we assume that the leaf x is the leaf on side G. We assume without loss of generality that N con- 
tains at least three leaves. (If N contains only two leaves then N is clearly already drooped). Let I be the 
lowest leaf (in N) on E\ C; A. Let p be the lowest leaf (in N) on F; D; B. Note that (by the assumption that 
there are at least three leaves in A^) at least one of I and p will exist. If they both exist then hang x back 
from the edges feeding into l,p and h: the leaf on side H. If (without loss of generality) only I exists then 
hang X back from the edges feeding into I, h and also the root. Consider then non-singlcton clusters in C that 
contain x but not I. Then the cluster either contains h or and we are done. Finally consider a cluster that 
contains I but not x. If p exists then it will not be in the cluster, so we are done. If p does not exist then we 
can use the root-edge. 

Case generator 2d 

As in case 2a we assume that there is at least one leaf on side E (because otherwise N was already drooped). 
We assume that the leaf x we removed is the leaf on side F of A''. Let I be the lowest leaf on side E in N 
(which must exist). Let p be the lowest leaf on side D in A^. If p exists then we can hang back x from the 
edges feeding into I and p. Otherwise we hang x back from the edge feeding into and the root. Consider 
then non-singleton clusters in C that contain x but not I. Then p must exist, and the cluster must contain 
it. For clusters that contain I but not x we can cither use the p-rcticulation edge (because it is not possible 
to contain I and p but not x) or use the root reticulation edge in A^'. A^' is thus level-1, and we have a 
contradiction. 

This concludes the case analysis for the generators 2a, 2b, 2c and 2d for the first major case i.e. when 
is not a maximal ST-set or the empty set. 

Second is a maximal ST-set w.r.t. C \ {x}. 

Here we argue that either A^ is already drooped w.r.t C (i.e. for some reticulation leaf x not only Xr, 
but also all other maximal ST-sets of C \ {x}, correspond to strict subtrees of N\ {x}) or that it is possible 
to transform A'' into a network A''' with this property. We again use a generator-based case analysis. 

To shorten the proofs we introduce abbreviations for several commonly-used concepts. If we say "hang 

X back from, li, . . . Ai" (i > 2) where li, . . . ,li G X we mean: (1) introduce x into the network as the only 
child of a new reticulation r^, (2) for each Ij subdivide the unique edge feeding into Ij to create a new node 
Vj, and finally (3) for each Ij we add the reticulation edge {vj,rx)- If we say "hang x hack from h,. . . ,li 
and the root" {i > 1) this is defined identically except that after steps (l)-(3) we additionally add an edge 
with head Vx and tail the root. As in earlier proofs we will make heavy use of the fact that, if x is hung 
back from (amongst others) Ij to obtain a network A''', then clusters C e C for which C fl {lj,x} = {lj,x} or 
C n {lj,x} = 0, are easily seen to be represented by A'^'. That is because in such cases the reticulation edge 
{vj,x) can simply be "switched on" . Clusters C of the form C fl {lj,x} = {Ij} or C fl {lj,x} = {x} require a 
little more work and in each case will be verifiable by inspection. 

If we "tidy up" a network we repeatedly apply the following five steps until none is applicable: (1) delete 
unlabeled nodes with outdegree 0; (2) suppress nodes with indcgree and outdegree 1 (i.e. contract one edge 
incident to the node); (3) replace multiple edges by single edges, (4) remove the root if it has outdegree 1 
and (5) contract biconnected components that have only one outgoing edge. Note that tidying up does not 
affect the set of clusters that a network represents and is simply a housekeeping measure. 

If we say. "move maximal ST-sets below cut-edges" we refer to the following (fundamental) procedure. 
Suppose A^ \ {x} represents C \ {x}. Suppose there is a non-singleton maximal subset S' of C \ {x} that does 
not correspond to a strict subtree of N\ {x}. Then we can pick any leaf I of S, delete all leaves I' G S \ {1} 
from N \ {x}, replace I with the unique tree that represents precisely those clusters in C\S, and tidy up. 



This creates a new network in which S does appear as a strict subtree and (bec;ause S is an ST-set) still 
represents C \ {x}. We can repeat this process until all such S appear as strict subtrees of the final network. 
In other words, until every maximal ST-set is equal to the set of leaves reachable from some cut-edge. Note 
that, crucially, this procedure will not affect singleton maximal ST-sets or (in this case) Xj. because these 
already correspond to strict subtrees of the network. 

We say "remove x and transform" to refer to the combined process of removing x and its parent (from 
A'') to obtain \ {x}, tidying this network up and subsequently moving all maximal ST-sets (of C \ {x}) 
below cut-edges. 

Case generator 2a 

Let / be the leaf on side F. Let a,b,d,e be the leaf on respectively side A,B,D,E that is furthest from 
the root. Let c be the leaf on side C that is closest to the root. Leaf e must exist, because otherwise N 
was already drooped. We take x = f. We distinguish two subcases. In one subcase we construct a drooped 
network N' by removing x and transforming, and then hanging x back in such a way that a network is 
created that represents all clusters in C. This network will be drooped (because, prior to hanging x back, 
we moved the maximal ST-sets of C \ {x} under cut-edges) and thus we are done. In the second case we 
will show that was already drooped. For the first case, suppose that leaf d exists. It is easy to see that 
(after removing x and transforming) hanging x back from e and d creates a drooped network w.r.t. C; the 
argumentation (e.g. regarding the four possibilities for C O {e, x} for each C <E C) is identical to that used 
in the previous proofs, and we are done. If d does not exist, and c does not exist, but a does exist, then we 
hang X back from e and a, done. If none of d, c, a exist we hang x back from e and the root, done. This leaves 
us with the case that d does not exist but c does exist. We observe that hanging x back from e and c creates 
a drooped network N' that is consistent with all clusters except with the possible exception of a cluster C 
that (in N) contains c and all leaves on side E, but not / or any leaves from side A. If such a cluster does 
not exist then we are done. 

Assume, then, that it does exist. Observe that if we had hung x back from e, c and the root then we 
would have created a (potentially level-3) drooped network w.r.t. C. Let A^* be the network obtained by 
tidying up N\ {x}. We observe that C \ {x} contains at most one non-singleton maximal ST-set that is not 
equal to Xj.. Suppose the opposite was true i.e. that C \ {x} contained at least two non-singleton maximal 
ST-sets not equal to X^- If we then moved maximal ST-sets under cut-edges we would create a network with 
at least two nontrivial cut-edges (excluding the cut-edge associated with the strict subtree corresponding to 
Xr). Hanging x back from e, c and the root in this network would create a network A'^' that represents C but 
which contains at least one nontrivial cut-edge. But the set of leaves reachable by a directed path from such 
a nontrivial cut-edge forms an unseparated subset of X, which by assumption is not possible. Now, if C \ {x} 
contains no non-singleton maximal ST-sets not equal to AT,, then we are immediately done, because A'^ was 
already drooped. So we conclude that there is exactly one non-singleton maximal ST-set 5* of C \ {a;} that is 
not equal to Xr, and that this must contain c. Note that, because of the existence of cluster C (in particular 
the fact that cluster C contains leaves from side E, and that X,. is already a maximal ST-set), S must be 
entirely contained within the leaves of side C (in A^). S must thus contain at least two leaves ci, C2 on side 
C. Let ci and C2 be the leaves in S that are furthest from the root, with C2 furthest away. Some cluster 
C separated ci from C2 in C and this proves that ci and C2 are the last two leaves on side C. (Otherwise 
C" \ {x} would prevent S from being a maximal ST-set of C \ {a;}). We conclude (again, by the separation 
of ci and C2) that C contained leaves from side E. But then C \ {x} prevents S from being a maximal 
ST-set of C \ {x}. We conclude thus that C \ {x} actually contains no non-singleton maximal ST-sets, with 
the possible exception of AT^. So A'^ was already drooped. 

Ccise generator 2b 

Let g, h be the leaves on sides G, H respectively. Let a be the leaf on side A furthest from the root, and 

define b, . . . , f similarly. If / exists then take x = h. We remove x and transform and then hang x back 
from / and b (if b exists) or otherwise from / and the root. A simple case-analysis shows that the resulting 
network is drooped. So assume that side F has no leaves. If leaf e exists then take x = g, remove x and 



transform as above, and hang x back from e and d if they both exist, otherwise e and the root. This again 
gives a network that is drooped w.r.t. C. So assume side E also contains no leaves. Suppose that side C 
contains no leaves. Then we take x = h and (after removing x and transforming) hang back from h and 
g (if b exists) and otherwise from g and the root. So there is at least one leaf on side C . If c is the only 
leaf on side C and none of the clusters {c,g},{c,h},{c,g,h} are in C, then we can safely move c to the 
top of side D, and we are back in the case when side C contains no leaves, done. If side C contains more 
than one leaf then at least one of {c, g}, {c, /i}, {c, g, /i} has to be in C, because otherwise c is unseparated 
from the leaf immediately above it. If {c, <?} G C then take x — h, otherwise take x = g. First suppose 
X = h. Observe that {c} is a maximal ST-set in C \ {x} because {c, g} & C \ {x} and (by assumption) {g} 
is a maximal ST-set in C \ {x}. Thus, when we remove x and transform, then neither c nor g will move 
in the sense of moving maximal ST-sets under cut-edges. This is a critical fact. So, after the removal of 
X and transformation the path of length two in A'' between the parent of c and the parent g' of g will 
have been suppressed (by the tidying up) to become a single edge {c' ,g'). We will now further expand the 
current notion of "hanging x back", which already defines "hanging back" from leaves and the root, to also 
define hanging back from an edge {u,v). When we hang x back from the edge {u,v) we subdivide {u,v) to 
create the new node u' and add the reticulation edge (u', Vx) (where is defined as earlier). We will hang x 
back from b and {c' ,g') (if b exists) and otherwise from {c' ,g') and the root. By inspection it can be verified 
that this gives a drooped network that represents C. Symmetrically, if x = g then we hang x back from d 
(if it exists, otherwise the root) and (c', h/) where c' is the parent of c and h' is the parent of h (because, 
again, {c} and {h} are max;imal ST-sets that do not move in the sense of moving maximal ST-sets under 
cut-edges). Again we obtain a drooped network w.r.t. C, and we are done. Note that the added complexity 
of this proof comes from the possible presence of clusters {g,h} in the input: this is why we have to hang 
one of the two reticulation edges of x from a carefully identified edge, rather than (as usual) a leaf or the root. 

Ccise generator 2c 

Assume a,...,g,h are defined as in case 2b. Let I be the lowest leaf (in A'') on E;C;A. Let p be the 
lowest leaf (in N) on F; D; B. At least one of I and p will exist, because we assume that N has at least 
three leaves. (Otherwise it is already drooped). Suppose {5, K] ^ C. Then we can take x = g, remove x and 
transform, and then hang x back from I and p (if they both exist), or from I and the root (if I exists), or 
from p and the root (if p exists). This will give a drooped network w.r.t. C. So assume {g, h} G C. We assume 
then, without loss of generality, that sides D and F contain no leaves. Suppose that side B also contains 
no leaves. In this case we take x = g, remove x and transform, then hang x back from I (which must exist) 
and the edge (root, h') where h/ is the parent of h. (Again, this edge was originally a path of length 2 in 
N that was subsequently suppressed by the tidying- up operation). This edge is definitely present because 
{h} is (by assumption) a maximal ST-set of C \ {x} and thus remains unaffected by the moving of maximal 
ST-sets under cut-edges. This creates a drooped network. So we assume that side B contains at least one 
leaf i.e. that leaf b exists. If b is the only leaf on side B and none of the clusters {b, g}, {b, h}, {b, g, h} are 
in C then we could move b to the top of side A and we are back in the case that side B contains no leaves, 
done. If side B contains more than one leaf then at least one of those clusters must be present, otherwise 
b and the leaf immediately above it were unseparated. If {b, /i} G C take x = g, otherwise take x = h. 
First suppose x = g. As in case 2b we argue that {6} and {h} are maximal ST-sets in C \ {a;} and thus 
that the edge (i.e. suppressed path) connecting the parent b' of b to the parent h' of h is unaffected by the 
movement of maximal ST-sets. In this case we hang x back from (6', h') and from I (if it exists; otherwise 
the root). It is not too difficult to verify that the resulting network is drooped w.r.t. C. The case x = h is al- 
most entirely symmetrical except that we must redefine I to be the lowest leaf onC;E;A (instead of E;C; A). 

Case generator 2d 

Take x = f, where / is the leaf on side F. Let e be the leaf on side E that is furthest from the root. 
(Leaf e must exist because otherwise = 0). Let d be the leaf on side D that is furthest from the root. We 
remove x and transform, and hang x back from e and d (if d exists) and otherwise from e and the root. This 
creates a drooped network. 



This concludes the second major case, and we have thus proven that a drooped simple level-2 network 
N' exists that represents C. To complete the overall proof we need to show that Cass will (re)construct N' , 
or some other simple level-2 network representing C. Suppose we contract all contraction-safe edges of N' 
to obtain N" . In some iteration, Cass correctly identifies a leaf x whose parent is a reticulation r and the 
maximal ST-set X,. which is the set of leaves below the only reticulation r' of N" \ {x}. 

Let T" be the tree obtained by removing x, r, Xr and r' from N" and contracting edges entering unlabeled 
tree-nodes with outdegree at most 1. We first show that T" is identical to the unique tree T that represents 
C \ {Xr U {x}) and which contains no contraction-safe edges. {T is the tree that Cass constructs in its 
innermost iteration). Suppose T" ^ T. Then T" will contain at least one contraction-safe edge, implying 
that N" also contains at least one contraction-safe edge, yielding a contradiction. 

If Xr ^ 0, one can reconstruct N" from T" as follows. First add a tree representing C\Xr below a 
reticulation and subsequently add x below another reticulation. Notice however that Cass always adds the 
reticulations below nodes inserted into edges, while in N" a reticulation might be a child of a node v with 
indegree one and outdegree at least two (observe that v cannot be a reticulation because Xr and that v 
cannot have indegree because Cass adds a dummy root with an edge to the old root). Cass adds the new 
reticulation below a node inserted into the edge entering v instead of below v, which leads to a network N'" 
that also represents C. 

To conclude the proof, consider the case Xr — 0. Assume without loss of generality that N" contains one 
reticulation with indegree 3 (if it contains two reticulations with indegree 2 then we can contract the edge 
between these reticulations). In this case, Cass constructs a network N'" representing C from T" as follows. 
It first adds a dummy leaf d below a reticulation, then it adds x below another reticulation. This second 
reticulation is added below nodes inserted into the edge entering d and one other edge. Before outputting 
the network, Cass removes d and contracts the edges between the two reticulations. As in the previous 
case, whenever in TV" the reticulation hangs below a node with outdegree at least two, Cass hangs the 
reticulation below a node inserted into the edge entering v instead. The resulting network N'" represents C. 
This concludes the proof. □ 



Lemma 2. Cass runs in time 0(|A'|^''+^ • \C\), if k is fixed. 



Proof. Let n = \X\ and m — \C\. We analyze the running time of constructing a simple level-< k network, 
since all other computations can clearly be done in 0{n^''~^^ ■ m) time. We will show by induction on k' 
that a call to Cass(C, A", fc, A:') takes at most 0{n^^ """^ • to) time and returns at most 0{n^'^ ) networks, for 
fixed k' . The lemma will follow because in the original call k' = k. 

For fc' = 0, Cass only checks if there exists a tree representing C, which can clearly be done in 0{n^ ■ m) 
time and leads to at most one network. Suppose k' > 1. The algorithm loops through 0{n) taxa and at 
most 0{n?^ recursively created networks. For each network, the algorithm loops through all pairs of 
edges. For fixed k\ each network contains 0{n) edges, since a tree contains at most 2n — 2 edges and the 
algorithm adds a constant number of edges in each iteration. For each combination of edges, Cass checks if 
the constructed network represents all clusters. This can be done by looping through the at most 2^ ways 
of switching edges on and off and checking if all to clusters are represented by one of the resulting trees, 
in 0{2^ ■ TO • n?) time. This is the bottleneck of all computations. Thus, the total time needed by Cass 
is 0(n • n?^ ■ n? ■ 2^ • TO • n^), which is 0{'n?^ """^ • to) for fixed k' . Similarly, the number of constructed 
networks is at most 0{n ■ n?^ ~^ ■ n?) and hence 0{n^'^ ). □ 

Lemma 3. For each r > 2, there exists a set Cr of clusters such that there exists a network with two 
reticulations that represents Cr while any galled network representing Cr contains at least r reticulations. 

Proof. Consider a simple level-2 network iV of type 2a, with r leaves on each of the sides B, C and E and 
a single leaf on side F. Let Cr be the set of all clusters represented by N. Suppose that there exists a galled 
network TV' representing C,- and containing r' < r reticulations. It is easy to check that the incompatibility 
graph of Cr (excluding singleton clusters) is connected and hence that N' contains just one biconnected 
component (except for the leaves). Thus, the r' reticulations of N' each have a leaf as child. Let C be the 



result of removing the r' taxa labeling these leaves from C,.. It follows that there exists a tree representing C 
and hence that C is compatible. However, C clearly contains at least one leaf on each of the sides B, C 
and E of N, say a leaf b on side B, c on side C and e on side E. Hence, there will be a cluster X £ C 
containing b and e but not c and a cluster Y ^ C containing c and e but not b. It follows that X and Y are 
incompatible, which is a contradiction because we have already shown that C is compatible. □ 



